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Abstract 

Quadratic regression (QR) models naturally extend linear models by considering 
interaction effects between the covariates. To conduct model selection in QR, it is 
important to maintain the hierarchical model structure between main effects and in¬ 
teraction effects. Existing regularization methods generally achieve this goal by solving 
complex optimization problems, which usually demands high computational cost and 
hence are not feasible for high dimensional data. This paper focuses on scalable regular¬ 
ization methods for model selection in high dimensional QR. We first consider two-stage 
regularization methods and establish theoretical properties of the two-stage LASSO. 

Then, a new regularization method, called Regularization Algorithm under Marginal- 
ity Principle (RAMP), is proposed to compute a hierarchy-preserving regularization 
solution path efficiently. Both methods are further extended to solve generalized QR 
models. Numerical results are also shown to demonstrate performance of the methods. 
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1 Introduction 


Statistical models involving two-way or higher-order interactions have been studied in 
various contexts, such as linear models and generalized linear models (Nelder, 1977; McCul- 
lagh & Nelder, 1989), experimental design (Hamada & Wu, 1992; Chipman et ah, 1997), 
and polynomial regression (Peixoto, 1987). In particular, a quadratic regression (QR) model 
formulated as 

b = /3o + AlAi + • • ■ + (5 p X p + + f3i^XiX2 + • • • + /3p, p X; + e (1) 

has been considered recently to analyze high dimensional data. In (1), A],..., X p are main 
effects, and order-2 terms XjX k (1 < j < k < p) include quadratic main effects (j = k ) and 
two-way interaction effects (j ^ k). A key feature of model (1) is its hierarchical structure, 
as order-2 terms are derived from the main effects. To reflect their relationship, we call XjX^ 
the child of Xj and A*,, and A j and A*, the parents of XjX^. 

Standard techniques such as ordinary least squares can be applied to solve (1) for a small 
or moderate p. When p is large and variable selection becomes necessary, it is suggested 
that the selected model should keep the hierarchical structure. That is, interaction terms 
can be selected into the model only if their parents are in the model. This is referred to the 
marginality principle (Nelder, 1977). In general, a direct application of variable selection 
techniques to (1) can not automatically ensure the hierarchical structure in the final model. 
Recently, several regularization methods (Zhao et ah, 2009; Yuan et ah, 2009; Choi et ah, 
2010; Bien et ah, 2013) have been proposed to conduct variable selection for (1) under the 
marginality principle by designing special forms of penalty functions. These methods are 
feasible when p is a few hundreds or less, and the resulting estimators have oracle properties 
when p = o(n ) (Choi et ah, 2010). However, when p is much larger, these methods are not 
feasible since their implementation requires storing and manipulating the entire 0(p 2 ) x n 
design matrix and solving complex constrained optimization problems. The memory and 
computational cost can be extremely high and prohibitive. 

In this paper, we study regularization methods on model selection and estimation for 
QR and generalized quadratic regression (GQR) models under the marginality principle. 
The main focus is the case p 3> n, which is a bottleneck for the existing regularization 
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methods. We study theoretical properties of a two-stage regularization method based on 
the LASSO and propose a new efficient algorithm, RAMP, which produces a hierarchy¬ 
preserving solution path. In contrast to existing regularization methods, these procedures 
avoid storing 0(p 2 ) X n design matrix and sidestep complex constraints and penalties, making 
them feasible to analyze data with many variables. In particular, our R package RAMP runs 
well on a desktop for data with n = 400 and p = 10 4 and it takes less than 30 seconds (with 
CPU 3.4 GHz Intel Core i7 and 32GB memory) to fit the QR model and get the whole solution 
path. The main contribution of this paper is threefold. First, we establish a variable selection 
consistency result of the two-stage LASSO procedure for QR and offer new insights on stage- 
wise selection methods. To our best knowledge, this is the first selection consistency result for 
high dimensional QR. Second, the proposed algorithms are computationally efficient and will 
make a valuable contribution to interaction selection tools in practice. Third, onr methods 
are extended to interaction selection in GQR models, which are rarely studied in literature. 

We define notations used in the paper. Let X = (xi,..., x„) T be the n x p design matrix 
of main effects and y = (yi, ...,y n ) T be the n-dimensional response vector. The linear term 
index set is Ai — {1,2, and the order-2 index set is X = {(j,k) : 1 < j < k < p). 

The regression coefficient vector (3 = (/3 0 , 0m, 0x ) T , where (3m = (/A, •••, (3 P ) T and (3x = 
(/3 1 , 1 , /3\ t 2, ■■., /3 P , P ) T ■ For a subset A C M., use (3a for the subvector of (3m indexed in A, and 
X _4 for the submatrix of X whose columns are indexed in A. In particular, Xj is the jth 
column of X. We treat the subscripts (j, k ) and (k,j) as identical, i.e., (3j^ = (3k.j- Let C\, c 2 , 
... and Ci, C‘ 2 , ... be positive constants which are independent of the sample size n. They are 
locally dehned and their values may vary in different context. For a vector v = (iq, ...,t^) T , 
ll v ll = \jY^j =i v j and ||v||i = Y!j=i \ v j\- Lor a matrix A, deffiie HAHoo = max* Y^j |Aj| and 
||A|| 2 = sup|| v || 2=1 11Av11 2 as the standard operator norm, i.e., the square root of the largest 
eigenvalue of A T A. 

The rest of the paper is organized as follows. Section 2 considers two-stage regularization 
methods for model selection in QR and studies theoretical properties of the two-stage LASSO. 
Section 3 proposes the RAMP to compute the entire hierarchy-preserving solution path 
efficiently. Section 4 discusses the generalizations of the proposed methods to GQR models. 
Section 5 presents numerical studies, followed by a discussion. Technical proofs are in the 
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Appendix. 


2 Two-stage Regularization Method 

Variable selection and estimation via penalization is popular in high dimensional analy¬ 
sis. Examples include the LASSO (Tibshirani, 1996), SCAD (Fan & Li, 2001), clastic net 
(Zou & Hastie, 2005), minimax concave penalty (MCP) (Zhang, 2010), among many others. 
Properties such as model selection consistency and oracle properties have been verified (Zhao 
& Yu, 2006; Wainwright, 2009; Fan & Lv, 2011). A general penalized estimator for linear 
models is defined as 

1 p 

(Po, Pm) = argmin— ||y - 1 (3 0 - X/3^|| 2 + V J* (#,■), (2) 

Mm) 2n 

where y is the response vector, X is the design matrix, J\{-) is a penalty function, and A > 0 
is a regularization parameter. The penalty X(-) and A may depend on index j. For easy 
presentation, we use same penalty function and parameter for all j unless stated otherwise. 

We consider the problem of variable selection for QR model (1). Define X° 2 = X o X as 
an n x p< ' p ^ 1 1 matrix consisting of all pairwise column products. That is, for X = (X 1; ..., X p ), 
X° 2 = XoX = (XiXX^XiXX^.^Xp *X p ), where * denotes the entry-wise product of two 
column vectors. For an index set A C A4, define *4° 2 = AoA = {(j, k) : j < k ; j, k e A} C X, 
and A o M. — {(j, k) : j < k\ j or k G A} C X. We use X^ 2 as a short notation for (X^)° 2 , 
a matrix whose columns are indexed by *4° 2 . 

Two-stage regularization methods for interaction selection have been considered in Efron 
et al. (2004); Wu et al. (2009), among others. However, their theoretical properties are not 
clearly understood. In the following, we first illustrate the general two-stage procedure for 
interaction selection. 

Two-stage Regularization Method-. 

Stage 1: Solve (2). Denote the selected model by A = {j : Pj ± Q ,J = 
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Stage 2: Solve 


P 


argmin ^~|| y - l^o - - x f/3> 


+ ^ J\{Pa), 

aGA ° 2 


At Stage 1, only main effects are considered for selection, with all the order-2 terms being 
left out of the model. Denote the selected set by A. At Stage 2, we expand A by including 
all the two-way interactions of those main effects within A and fit the new model. To keep 
the hierarchical structure, we do not penalize main effects at Stage 2, i.e., set J\{-) = 0 for 
j e A. In order to keep the hierarchy, it is also possible to use other methods (Zhao et ah, 
2009; Yuan et ah, 2009; Choi et al., 2010; Bien et ah, 2013) at Stage 2. 

One main advantage of this two-stage regularization procedure is its simple implementa¬ 
tion. Existing R packages lars and glranet can be directly used to carry out the procedure. 
Stage 1 serves as a dimension reduction step prior to Stage 2, so the two-stage method avoids 
estimating 0(p 2 ) parameters altogether, making the procedure feasible for very large p. 

In spite of its computational advantages, theoretical properties of two-stage regulariza¬ 
tion methods are seldom studied in literature. A commonly raised concern is whether the 
important main effects can be consistently identified at Stage 1, when all order-2 terms are 
left out of the model on purpose. Next, we focus on the two-stage LASSO method and inves¬ 
tigate its selection behavior at Stage 1. In particular, we establish the main-effect selection 
consistency result of the two-stage LASSO for QR under some regularity conditions. 

The LASSO is a special case of (2) by using the i\ penalty 

1 p 

0ol,Pl ) = argmin — ||y - l/3 0 - X/3m|| 2 + V X\/3j\. 

(0o,l3 m) 2n 

In the following, we show that the LASSO solution f3 L is sign consistent at Stage 1, i.e., 
sign(/3i) = sign(/3v() with an overwhelming probability for a properly chosen tuning param¬ 
eter. This result provides critical theoretical insight about the two-stage LASSO estimator. 

Consider a sparse quadratic model with a Gaussian design. Assume that Xj, 1 < i < n, 
are independent and identically distributed (i.i.d.) from A/”(0, £), and 

Vi = A) + xj p M + (xT ) o2 pi + £ i7 (3) 
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where e = (ey, ...,£ n ) T ~ J\f( 0, <r 2 I) is independent of {xj}” =1 . Without loss of generality, we 
further center y t and (xj ) 02 and write 

Vi = *7 Pm + u 7 Pi + ( 4 ) 

where y i is the centered response and u/" = (x^) 02 — E(x7)° 2 is a p x (p + l)/2 row vector 
with all centered order-2 terms. Let y Mi = xj (3 M and y Xi = u J fa. y M = (y Mh , ...,y M n) T , 
yi = {iJta,-. -lift,,) ■ U = (ui, ...,u n ) T . Set r 2 = Var(y Xi ). Dehne t <7 = u J f3 x + £i and 
<jj = (ay, ...,o; n ) T , which is treated as noise at Stage 1. Denote by the submatrix of E 
with row index A and column index B. As illustrated in Hao & Zhang (2014b), the support 
and sign of the coefficient vector (3 M for a QR model depend on its parametrization because 
a coding transformation can change the support of (3 m- Therefore, we follow Hao & Zhang 
(2014b) and dehne the index set of important main effects by S = {j : Pj + ELi Pj,k > °}- 
Let s = |«S| and T = {( k ,£) : (Ax p 1 0}- It follows this dehnition that T C S°' 2 . Moreover, in 
order to make sign(/Tvf) wcll-dehned, we require that main effects are centered in (3). We 
refer to Hao & Zhang (2014b) for further explanations on the well-dehnedness of sign and 
support of the coefficient vector / 3m f° r a QR model. 

Dehne E^qs = E^c — T,sos('^‘Ss)~ 1 ’^‘SS c where S c = M.—S. Let A min (A) be the smallest 
eigenvalue of A and p u (A) = max,; A ti . Assume the following technical conditions: 


(Cl) (Irrepresentable Condition) UEsc^Ess) 1 || 00 < 1 — 7, 7 € (0,1]. 
(C2) (Eigenvalue Condition) A m i n (Ess) > C' m in > 0. 


Theorem 1 Consider the quadratic model with a random Gaussian design (4)- Suppose 
that (C1)-(C2) hold. Consider the family of regularization parameters 


XnfPp) — 


1 PpPui^S^s) 4(<J 2 + T 2 ) log(p) 


(5) 


7^ n 

for some <f p > 2. If for some fixed 5 > 0, the sequence (■ n,p,s ) and regularization sequence 
{A n } satisfy 

n 


2s log(p — s) 




C ■ o/ 2 

'- y mm / 


XI: 


( 6 ) 


then the following holds with probability greater than 1 — <y exp(—c 2 min{.s, log(p — s), n?}). 
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1. The LASSO has a unique solution (3 l with support contained within S. 

2. Define the gap 



i 

2 / 

di^n) QjA n 

V 2 
u ss 

«A 20 V; 


a 2 s 


911 112 

Cmin n C m \„n3 


+ 


Then if /3 min = unn jeS \Pj\ > d{K), then sign(/3 L ) = sign(/3 >( ). 


( 7 ) 


Furthermore, given (5), an alternative condition to (6) making the above results hold is 

n 1 + S' p u {Z S c\s) 

2s log (p - s) > 1 - j- C min 7 2 

for some S' > 0. 


Remark 1. Conditions (C1)-(C2) are commonly used to show model selection consistency 
of the LASSO estimator in the literature. Conditions (6) and (7) are key requirements on 
dimensionality and minimal signal strength /3 m i n , respectively. The normality assumption 
is used here to facilitate proof and comparison to existing results in linear regression. In 
the supplementary material, we establish Theorem 2 which extends the consistency result 
to the non-Gaussian designs. Other possible extensions of theoretical results are discussed 
in Section 6. 

Remark 2. The result in Theorem 1 generalizes Theorem 3 in Wainwright (2009) that is 
established in the context of linear regression. Theorem 1 implies that the two-stage LASSO 
can identify important main effects at Stage 1. The validity of the two-stage LASSO is 
then guaranteed as the index set of important interactions T C S° 2 . That is, all important 
interaction effects can be included at Stage 2. Given the result of Theorem 1, the interaction 
selection consistency result of Stage 2 can be obtained under some mild conditions on the 
matrix X^ 2 , since the data dimensionality has been greatly reduced. One can also apply 
existing methods, e.g., Choi et al. (2010) at Stage 2, for which the selection consistency has 
been established. 
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3 Regularization Path Algorithm under Marginality 
Principle (RAMP) 

For linear regression models, regularization solution-path algorithms provide state-of-the- 
art computational tools to implement variable selection with high dimensional data. Popular 
algorithms include least angle regression (LARS) (Efron et al., 2004), its extensions (Park 
& Hastie, 2007; Wu, 2011; Zhou & Wu, 2014), and coordinate decent algorithm (CDA) 
(Friedman et al., 2007; Wu & Lange, 2008; Friedman et al., 2010; Yu & Feng, 2014). These 
computational tools can be used to implement two-stage methods for fitting QR. However, 
by the nature of two-stage approach, the whole solution-path highly depends on the selection 
result at Stage 1, which is obtained under considerably high noise level if interaction effects 
are strong. Therefore, it is desirable to develop a seamless path algorithm which can select 
main and interaction effects simultaneously while keeping the hierarchy structure. To achieve 
the goal, we propose a Regularization Algorithm under Marginality Principle (RAMP) via 
the coordinate descent to compute the solution path while preserving the model hierarchy 
along the path. 

We first review the coordinate decent algorithm for the standard LASSO. Consider 

1 n 

min 2 ^ ~ *JPm) 2 + A||Am||i. 

i= 1 

There exists a penalty parameter A max such that the minimizer (3l = 0 if A > A max . As A 
decreases from A max to 0, the LASSO solution f3 L = f3\ changes from 0 to the least squares 
estimator (if it exists). Usually, a sequence of values {Afc}f =1 between A max and ()A max is set, 
with 0 < C < 1, and a solution path (3\ k is calculated for each A*,. For a fixed k , using (3\ k _ 1 
as the initial value, the CDA solves the optimization problem by cyclically minimizing each 
coordinate (3j until convergence. Define A4 k = supp{{3\ k }, i.e., the active set for each X k . 

In the following, we propose a coordinate descent algorithm to fit the quadratic model 
under regularization which obeys the marginality principle. Given a tuning parameter A, the 
algorithm computes the l\ regression coefficients of main effects and interactions subject to 
the heredity condition. At step k — 1, denote the current active main effect set as M . k -i and 



the interaction effect set as Ik- 1 - Define 'Hk-i as the parent set of Ik- 1 , i.e., it contains the 
main effect which has at least one interaction effect (child) in I}.-\- Set H'k-i = Ai — Hk-i- 

Regularization Algorithm under Marginality Principle (RAMP): 

Initialization-. Set A max = n _1 max|X T y| and A min = ()A max with some small ( > 0. 
Generate an exponentially decaying sequence A max = Ai > A 2 > • • • > \k = A m i n . Initialize 
the main effect set Aio = 0 and the interaction effect set X 0 = 0. 

Path-building : Repeat the following steps for k — 1, • • • , K. Given Aik-i,Ik-i,Pk-i, 
add the possible interactions among main effects in Aik -1 to the current model. Then with 
respect to (/do, 0m, /3^ 0 2 ) t , we minimize 

\ n 2 

7^ ^ (yi ~ A) ~ x 7 Am ~ ( x 7 )m*-, Aug 2 ,) + Afc||/3^c_J|i + AfeH/3^02 Jliy (9) 

i =1 

where the penalty is imposed on the candidate interaction effects and which contains 

the main effects not enforced by the strong heredity constraint. Record Aik, Ik and Pk 
according to the solution. Add the corresponding main effects from into Aik to enforce 
the heredity constraint, and calculate the OLS based on the current model. 

Compared with the two-stage approach, the RAMP allows at each step the interaction 
effects Ad°k_i to enter the model for selection. Following the same strategy, we propose a 
weak hierarchy version of RAMP, denoted by RAMP-w, as a flexible relaxation. The main 
difference is that we use the set A4k-i°A4 instead of Ai°k_i in (9) and solve the optimization 
problem with respect to (/3 0 , Pm, ^ dear that an interaction effect can enter 

the model for selection as long as one of its parents is selected in a previous step. Therefore, 
it is helpful in the scenario when only one of the parents of an important effect is strong. 
Both versions of RAMP are implemented in our R package RAMP, which is available on the 
GRAN web site for researchers to use. Moreover, other penalty options such as SCAD and 
MCP are also included in the RAMP package. 

Figure 1 illustrates two hierarchy-preserving solution paths obtained by the RAMP under 
strong and weak heredity constraints, respectively. In this toy example, n = 500, p = 100, 


9 



and X %3 *~ d ' _A/”(0,1), and Y = X\ + 3X 6 + 4XjX 3 + 5 X 3 X 0 + e, where e ~ A^(0,1). Without 
the marginality principle, the interaction term X\X^ would be the most significant predictor 
as it has the highest marginal correlation with the response Y. On the other hand, RAMP 
with the strong heredity selects X x and X 6 before picking up X\X§ on the solution path. 
Note that RAMP does not select the interaction X ] X :i until at a very late stage on the 
solution path due to the strong heredity assumption. Under the weak heredity assumption, 
the RAMP-w is able to select in sequence X 6 , AbX6, Xi and X 1 X 3 . The reason is that 
after X 6 is selected, X 1 X 6 is immediately added to the candidate interaction set and then 
selected, even before X x is selected. Similarly, the interaction AbX 3 is picked up by the 
algorithm after one of its parents X\ is picked up. 



Figure 1: Two hierarchy-preserving solution paths for a toy example produced by the RAMP 
and RAMP-w, respectively. Left Panel: strong hierarchy. Right Panel: weak hierarchy. 
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4 Extension to Generalized Quadratic Regression Mod¬ 
els 

4.1 Generalized Quadratic Regression 

A standard generalized linear model (GLM) assumes that the conditional distribution of 
y given X belongs to the canonical exponential family with density 

fn(y , x > 0) = II ^ o( ^ ; 6 <) = II { c (Vi) ex P ~ } ’ 

where (f> > 0 is a dispersion parameter, [3 = (j3 1 , ...,^ P ) T are the regression coefficients, and 

6 = (9 ll ...,9 n ) T =X/3. (10) 

The function 6(0) is twice continuously differentiable with a positive second-order derivative. 
In sparse high dimensional modeling, (3 is a long vector with a small number of nonzero 
entries. In the context of QR, the design matrix is (X,X° 2 ). A natural generalization of 
GLM is to modify (10) as 

6 = (e u ..., 9 n ) T = X(3 m + X o2 /3 x . (11) 

In the literature, there are very few computational tools available to fit high dimensional 
GQR models. Next, we illustrate how the aforementioned algorithms can be used for GQR. 

4.2 Two-stage Regularization Methods 

For high dimensional data, the penalized likelihood method is commonly used to fit GLM. 
Given the systematic component (10), the penalized likelihood estimator is defined as 

p 

argmin Q n {(3) = argmin -£ n {{3) + V' 

<> p U 

where £ n (/3) = log/ n (y; X, (3) — 1 (y T X/3 — l T b(X/3)) is the log-likelihood up to a scalar, 
J\(-) is a penalty function and A > 0 is the regularization parameter. 


11 



For GQR with systematic component (11), we propose the two-stage approach as follows. 
At Stage 1, only main effects are selected by the penalization method with order-2 terms 
being left out. Denote the selected main-effect set by A. At Stage 2, we expand A by adding 
all the two-way interactions (children) of those main effects (parents) within A and solve 

argminQn(/3,M) = argmin —4(/3, A) + Y] J\((d a ), 

/3 /3 

aeA ° 2 

where 

403,-4) = l [y T (X^ + Xf/3 >2 ) - l T b(X^ + Xf/3^)] • 

At Stage 2, we intentionally do not impose penalty on main effects in A, so that all the 
selected main effects at Stage 1 will stay in the final model. This will assure the hierarchical 
structure of main effects and interactions in the final model. 

4.3 New Path Algorithm for Generalized QR 

The RAMP proposed in Section 3 can be easily extended to fit the GQR. The major 
difference is to replace the penalized least squares by the penalized likelihood function at each 
step. The CDA algorithm is used to minimize the penalized likelihood function iteratively. 

RAMP Algorithm for GQR: 

Initialization-. Set A max = n _1 max |X T y| and A m i n = ()A max with 1 > ( > 0. Generate 
an exponentially decaying sequence A max = Ai > A 2 > • • • > A k = A min . Initialize the main 
effect set = 0 and the interaction effect set X 0 = 0. 

Path-building : Repeat the following steps for k — 1, • • • , K. Given 
add the possible interactions among main effects in A4-i to the current model. Then with 
respect to (/3 0 , f3 T , /3T. 0 2 ) T , we maximize 

4(/3, Mk-i) — A fc ||/3^c_ i || x — A fe 11/3^02^ ||i, 
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where 


£n{fl, Ai-k-l) 


1 


n 


y T + ^°M k _ 1 ^M° k 2 _ 1 ) 

l T b(X A ^ fe _ 1 /3^ (fc _ 1 + X^ fe _ i /3_ M o2_ i ) 


Calculate Aik, Ik and Hk according to the solution. Add the corresponding main effects 
from Ik into Aik to enforce the heredity constraint, and calculate the MLE based on the 
current model. 


5 Numerical Studies 

5.1 Simulation Examples 

We consider data generating processes with varying signal-to-noise ratios, different co¬ 
variate structures, error distributions, and heredity structures. In particular, Example 1 is a 
QR model under ap>n settings with strong heredity considered in Hao & Zhang (2014a). 
Example 2 is a high-dimensional logistic regression model with interaction effects. Examples 
3 and 4 consider QR models with the weak and strong heredity structures respectively, where 
we consider a relatively small p to make the comparison possible with the hierarchical lasso 
(Bien et ah, 2013). Example 5 considers a QR model with a heavy tail error distribution to 
demonstrate the robustness of our methods. 

For comparison, we consider RAMP and two two-stage methods, i.e., two-stage LASSO 
(2-LASSO) and two-stage SCAD (2-SCAD). We also include existing methods iFORT and 
iFORM (Hao & Zhang, 2014a), the hierarchical lasso (Bien et ah, 2013), and the benchmark 
method ORACLE for which the true sparse model is known. 

When computing the solution paths of two-stage methods and RAMP, we choose the 
tuning parameter by EBIC with 7 = 1 (Chen & Chen, 2008). We also implemented other 
parameter tuning criteria including AIC, BIC, and GIC (Fan & Tang, 2013), and observed 
that the EBIC tends to work the best among most of the simulation settings that we con¬ 
sidered. For easy presentation, we report only the results for EBIC. 
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Let S = {j : /3j ^ 0} and T = {(j, k ) : ^ 0} with cardinality s = |<5| and t — \T\. 

For each example, we run M = 100 Monte-Carlo simulations for each method and make 
a comparison. For the m-th simulation, denote the estimated subsets as and 
the estimated coefficient vector as (3^ m \ the main effects and interaction effects as /3j m ' 1 and 
/3j™\ We evaluate the performance on variable selection and model estimation based on the 
following criteria. 

• Main effects coverage percentage (main.cov): C S l ' rn> ). 

• Interaction effects coverage percentage (inter.cov): M _1 ^^ =1 /(T C T^). 

• Main effects exact selection percentage (main.exact): M -1 Yhm =i = 5*^). 

• Interaction effects exact selection percentage (inter.exact): M~ l i = ). 

• Model size (size): M~ l + |T^|). 

. Root, mean squared error (RMSE): {M~' ZLiEJ-odj’” 1 “ft) 2 + EJ.i ELjdi? “ 

ft,t) 2 ]} 1/2 - 

Example 1 Set (n,p, s, t ) = (400, 5000,10,10). Generate the covariates {xj}” =1 ' jV(0, S) 

with Ejfc = 0.5l J_fc l and generate the response y by model (1). S = {1, 2, • • • , 10} with the 
true regression coefficients /3s = (3, 3, 3, 3, 3, 2,2, 2, 2, 2) T . The set of important interac¬ 
tion effects is T = {(1,2), (1,3), (2, 3), (2,5), (3,4), (6, 8), (6,10), (7,8), (7, 9), (9,10)} with 
the corresponding coefficients (2, 2, 2, 2, 2,1,1,1,1,1). 

To have different signal-to-noise ratio situations, we consider o e {2,3,4}. The results 
are summarized in Table 1. With regard to model selection, the proposed RAMP has a high 
coverage percentage in selecting both main effects and interaction effects. The 2-LASSO 
tends to miss some important main effects while picking up some noise variables, ending up 
with the largest model size on average. On the other hand, the 2-SCAD has a high exact 
selection percentage with a low coverage percentage. Compared to RAMP, the iFORM 
tends to have a lower coverage on interaction effects. The iFORT is the worst in terms both 
variable selection and model estimation. With regard to parameter estimation, RAMP has 
the smallest root mean square error (RMSE) when o = 3 and 4. 
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Table 1: Selection and estimation results for Example 1. 



a 

main effects 

coverage exact 

interaction effects 

coverage exact 

size 

RMSE 


2 

1.00 

0.96 

1.00 

0.35 

20.98 

0.87 

RAMP 

3 

0.99 

0.91 

0.83 

0.17 

21.25 

1.29 


4 

0.92 

0.77 

0.47 

0.11 

20.83 

1.96 


2 

0.78 

0.60 

0.78 

0.01 

24.77 

1.56 

2-LASSO 

3 

0.75 

0.56 

0.75 

0.01 

24.64 

1.85 


4 

0.72 

0.51 

0.69 

0.01 

24.40 

2.20 


2 

0.70 

0.58 

0.70 

0.53 

19.92 

1.81 

2-SCAD 

3 

0.69 

0.55 

0.62 

0.26 

20.31 

2.06 


4 

0.65 

0.52 

0.43 

0.14 

20.56 

2.42 


2 

0.00 

0.00 

0.00 

0.00 

14.54 

6.64 

iFORT 

3 

0.00 

0.00 

0.00 

0.00 

13.74 

7.02 


4 

0.00 

0.00 

0.00 

0.00 

12.72 

7.52 


2 

1.00 

0.98 

0.98 

0.40 

20.71 

0.59 

iFORM 

3 

1.00 

0.97 

0.34 

0.17 

19.94 

1.40 


4 

0.97 

0.97 

0.02 

0.01 

18.71 

2.16 


2 

1.00 

1.00 

1.00 

1.00 

20.00 

0.55 

ORACLE 

3 

1.00 

1.00 

1.00 

1.00 

20.00 

0.83 


4 

1.00 

1.00 

1.00 

1.00 

20.00 

1.11 


Example 2 We consider a logistic regression model with 

log jD(\r -TWo = + 3*6 + 3A io + 2>X\Xq + 3X" 6 Ado, 

W* - U|AJ 


where (■ n,p,s,t ) = (400,2000,3,2) and X A/”(0,I p ). For different signal-to-noise ratios, 
we vary the coefficient (3\ G {1, 2, 3}. 


The results are summarized in Table 2, which lead to the following observations. When 
the signal is strong (/3i = 2, 3), RAMP, 2-LASSO and 2-SCAD perform similarly in selecting 
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Table 2: Selection and estimation results for Example 2. 



A 

main effects 

coverage exact 

interaction effects 

coverage exact 

size 

RMSE 


l 

0.92 

0.78 

0.92 

0.91 

4.98 

1.80 

RAMP 

2 

1.00 

0.93 

1.00 

1.00 

5.08 

1.16 


3 

1.00 

0.92 

0.99 

0.99 

5.13 

1.36 


1 

0.45 

0.41 

0.45 

0.14 

4.05 

3.97 

2-LASSO 

2 

1.00 

0.93 

1.00 

0.29 

6.58 

1.41 


3 

1.00 

0.80 

1.00 

0.42 

6.31 

1.66 


1 

0.49 

0.43 

0.49 

0.49 

3.58 

3.76 

2-SCAD 

2 

1.00 

0.81 

1.00 

0.94 

5.28 

1.03 


3 

1.00 

0.74 

1.00 

0.86 

5.52 

1.22 


1 

1.00 

1.00 

1.00 

1.00 

5.00 

0.84 

ORACLE 

2 

1.00 

1.00 

1.00 

1.00 

5.00 

0.78 


3 

1.00 

1.00 

1.00 

1.00 

5.00 

0.83 


main effects; while RAMP and 2-SCAD is much better in selecting interactions than 2- 
LASSO. When the signal is weak (/A = 1), 2-LASSO and 2-SCAD fail to identify the correct 
main effects most of time, which in turn leads to low coverage of important interaction 
effects. On the other hand, RAMP performs reasonably well in terms of selecting both main 
effects and interaction effects. With regard to RMSE, RAMP outperforms 2-LASSO and 
2-SCAD in all scenarios. Note that the iFORT and iFORM are omitted in this example, as 
they do not handle binary responses. 

In the next two examples, we compare RAMP and hierNet algorithms for both strong 
and weak hierarchy scenarios. 


Example 3 Set ( n,p , s, t ) = (400,100,10,10). Generate the covariates {xj}” =1 ' J\f[0, £) 

with E jk = 0.5l J—fc l and generate the response y by model (1). S = {1, 2, • • • , 10} with the 
true regression coefficients (3s = (3, 3, 3, 3, 3, 2, 2,2,2, 2) T . The set of important interaction 
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Table 3: Selection and estimation results as well as average computing time (in seconds) per 
replicate for Example 3. 



a 

main effects 

coverage exact 

interaction effects 

coverage exact 

size 

RMSE 

Time 


2 

1.00 

0.71 

0.00 

0.00 

19.45 

3.54 

37.49 

RAMP 

3 

1.00 

0.83 

0.00 

0.00 

16.86 

3.71 

34.74 


4 

0.98 

0.89 

0.00 

0.00 

15.28 

3.87 

34.88 


2 

1.00 

1.00 

0.99 

0.25 

21.33 

0.79 

47.02 

RAMP-w 

3 

1.00 

0.99 

0.63 

0.12 

21.16 

1.31 

46.51 


4 

1.00 

0.98 

0.16 

0.00 

20.07 

1.98 

46.10 


2 

1.00 

0.00 

1.00 

0.00 

133.45 

5.69 

3143.30 

hicrNet-s 

3 

1.00 

0.00 

0.96 

0.00 

119.62 

5.33 

3232.62 


4 

1.00 

0.00 

0.74 

0.00 

95.06 

5.01 

3507.85 


2 

1.00 

0.00 

1.00 

0.00 

126.83 

6.60 

295.88 

hicrNet-w 

3 

1.00 

0.01 

0.98 

0.00 

96.59 

6.17 

346.83 


4 

1.00 

0.04 

0.75 

0.00 

65.31 

5.73 

444.99 


effects is T = {(1,2), (1,13), (2, 3), (2,15), (3,4), (6,10), (6,18), (7, 9), (7,18), (10,19)} with 
the corresponding coefficients (2, 2, 2, 2, 2,1,1,1,1,1). 

In this example, the strong heredity does not hold while the weak heredity is satisfied. Note 
that we take p to be relatively small due to the heavy computational cost of hicrNet (Bien 
et al., 2013). Here, we compare RAMP and RAMP-w (RAMP with the weak heredity 
constraint) with hierNet-s and hierNet-w, and the results are summarized in Table 3. As 
expected, when applying RAMP with strong heredity (RAMP), it always misses some im¬ 
portant interaction effects. However, the RAMP with weak heredity (RAMP-w) successfully 
recovers the important interaction effects with a high proportion, especially when the error 
variance is small. Comparing with the hicrNet, the RAMP-w in general selects a much 
smaller model with a smaller RMSE. In particular, the computation time of hierNet is much 
longer than RAMP for both the strong and weak versions. 
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Tabic 4: Selection and estimation results as well as average computing time (in seconds) per 
replicate for Example 4. 



o 

main effects 

coverage exact 

interaction effects 

coverage exact 

size 

RMSE 

Time 


2 

1.00 

1.00 

1.00 

0.35 

20.97 

0.86 

34.58 

RAMP 

3 

1.00 

0.98 

0.93 

0.23 

21.31 

1.18 

32.95 


4 

0.97 

0.92 

0.64 

0.10 

21.35 

1.72 

32.28 


2 

1.00 

1.00 

0.99 

0.25 

21.25 

0.87 

56.01 

RAMP-w 

3 

1.00 

1.00 

0.78 

0.18 

21.14 

1.25 

54.58 


4 

1.00 

1.00 

0.30 

0.06 

20.02 

1.92 

53.71 


2 

1.00 

0.00 

1.00 

0.00 

120.99 

5.53 

15847.28 

hierNet-s 

3 

1.00 

0.00 

0.99 

0.00 

115.69 

5.15 

16552.18 


4 

1.00 

0.00 

0.92 

0.00 

90.55 

4.79 

16864.49 


2 

1.00 

0.01 

1.00 

0.00 

97.62 

5.79 

1467.46 

hicrNet-w 

3 

1.00 

0.02 

0.98 

0.00 

61.04 

5.41 

1798.27 


4 

1.00 

0.01 

0.90 

0.00 

53.31 

5.24 

2156.99 


Example 4 Set (■ n,p,s,t ) = (400,200,10,10). The rest setup is same as Example 1. 

In this example, we consider the case where the strong heredity holds and compare RAMP 
and RAMP-w with hierNet-s and hicrNet-w. From Table 4, it is clear that RAMP outper¬ 
forms RAMP-w in terms of both the coverage percentage and the exact selection percentage 
for interaction effect. This is not surprising as the RAMP-w searches for additional interac¬ 
tion effects compared with RAMP. In addition, the RMSE of RAMP is the smallest among 
the four methods throughout all noise levels. Both hierNet-s and hicrNet-w have very good 
coverage percentage but with almost zero exact selection percentage for both main effects 
and interaction effects. As a result, they select a large number of noise variables in the final 
model. Note that the computation time for hierNet-s is over 4 hours for a single replicate. 
As a result, we omit the comparison with hierNet for the other higher dimensional examples. 
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Table 5: Selection and estimation results for Example 5. 



a 

main effects 

coverage exact 

interaction effects 

coverage exact 

size 

RMSE 


2 

1.00 

0.94 

0.98 

0.29 

21.59 

1.02 

RAMP 

3 

0.97 

0.92 

0.84 

0.18 

21.37 

1.52 


4 

0.90 

0.76 

0.49 

0.08 

21.00 

2.33 


Example 5 We use the same setting as in Example 1 except for the error distribution, 
which is changed to a t distribution with degrees of freedom 3. 

This example is designed to examine the robustness of proposed methods under heavy tail 
error distributions. For brevity, we report only the performance of the vanilla RAMP with 
strong heredity enforced. It is clear from Table 5 that under the heavy tail error distribution, 
RAMP has a similar performance as in Example 1. 

5.2 Real Data Example: Supermarket Data 

We consider the supermarket dataset analyzed in Wang (2009) and Hao & Zhang (2014a). 
The data set contains the daily sale information of a major supermarket located in northern 
China, with n = 464 and p — 6, 398. The total number of interaction effects is about 
2.0 x 10'. The response Y is the number of customers on a particular day with the predictor 
X measuring sale volumes of a selection of products. The supermarket manager would like 
to find out which products are most informative in predicting the response, which would be 
useful to design promotions around those products. 

Here, we randomly split the data into a training set (ni = 400) and a test set (n 2 = 64) to 
evaluate the prediction performance of different methods. We also compare the performance 
of RAMP with the regular LASSO without taking interaction effects into account. Because of 
the issue of tuning parameter selection, we report the results using different tuning methods 
including AIC, BIC, EBIC (Chen & Chen, 2008), and GIC (Fan & Tang, 2013) for both 
RAMP and the LASSO. 
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Table 6: Mean selection and prediction results on the supermarket data set over 100 random 
splits. The standard errors are in parentheses. 



RAMP 

LASSO 


size 

size, inter 

R 2 

size 

size, inter 

R 2 

AIC 

229.12(1.68) 

94.53(1.06) 

90.48(0.23) 

264.28(0.91) 

0.00(0.00) 

92.04(0.18) 

BIC 

101.17(3.25) 

34.36(1.65) 

91.18(0.20) 

63.47(0.77) 

0.00(0.00) 

90.76(0.20) 

EBIC 

29.27(1.01) 

3.07(0.29) 

89.67(0.31) 

15.62(0.46) 

0.00(0.00) 

72.09(0.53) 

GIC 

30.71(0.92) 

3.20(0.30) 

90.08(0.28) 

19.19(0.74) 

0.00(0.00) 

75.05(0.58) 


For each random split, we calculate the number of selected variables, the number of se¬ 
lected interaction effects, and the out-of-sample R 2 on the test set. The average performance 
over 100 random splits is presented in Table 6. When we use BIC, EBIC and GIC, RAMP 
selects a model with higher out-of-sample R 2 values than the LASSO. When using more 
stringent tuning parameter criteria like the EBIC and GIC, it is observed that the RAMP 
performs significantly better than the LASSO. For example, when GIC is used, RAMP selects 
30 variables on average with around 3 of them being interaction effects, and has an aver¬ 
age out-of-sample R 2 value of 90.08, which is much higher than the corresponding LASSO 
results. It is clear that by using RAMP with the inclusion of possible interaction effects, 
we can obtain a more interpretable model with a reasonably good prediction performance. 
Moreover, from Table 8 in Hao & Zhang (2014a), the out-of-sample R 2 values with the asso¬ 
ciated standard error for iFORT and iFORM are 88.91 (0.17) and 88.66 (0.18), respectively, 
both of which are outperformed by RAMP with any tuning parameter selection method. 


6 Discussion 

We study regularization methods for interaction selection subject to the marginality 
principle for QR and GQR models. One main advantage of these algorithms is their compu¬ 
tational efficiency and feasibility for high and ultra-high dimensional data. In particular, a 
key feature of RAMP is that it can select main and interaction effects simultaneously while 
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still keeping the hierarchy structure. The strategy of RAMP can be used to extend other 
algorithms, e.g., LARS, to build the entire solution path when fitting the regularized QR 
models. All algorithms considered in this paper utilize the hierarchy structures. Such struc¬ 
tures are natural for quadratic models (Nelder, 1977; Hao & Zhang, 2014b). Nevertheless, in 
certain applications, some main effects may not be strong enough to be selected first without 
incorporating the interaction effects. Other approaches (Zhao et ah, 2009; Yuan et ah, 2009; 
Choi et ah, 2010; Bien et al., 2013) can be applied in this scenario, as these methods keep 
the hierarchy in different ways. However, a drawback is that most of these algorithms are 
relatively slow when p is large. Recently, there have been studies on interaction selection 
which do not rely on the strong or weak hierarchy. Based on the idea of sure independence 
screening (Fan & Lv, 2008; Fan et ah, 2011; Cheng et ah, 2014), Jiang & Liu (2014) proposed 
Sliced Inverse Regression for Interaction Detection (SIRI) for screening interaction variables; 
Fan et ah (2016) introduced a new approach called interaction pursuit for interaction iden¬ 
tification using screening and variable selection. It would be interesting to incorporate these 
screening based methods into our path algorithm to handle general scenarios. 

We demonstrate theoretical properties of the two-stage LASSO method for QR. As a ref¬ 
eree pointed out, selection consistency results on the LASSO often rely on the irrepresentable 
condition, which is not realistic in applications. In order to extend current results, it is de¬ 
sirable to investigate a broad range of penalty functions for GQR, e.g., under frameworks 
similar to Fan & Lv (2011) and Fan & Lv (2013). 

An R package RAMP has been developed and is available from the ORAN website. 


7 Appendix 

The main results are shown in Appendix A, and a related lemma is put in Appendix B. 
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7.1 Appendix A 


Proof of Theorem 1. We will apply the primal-dual witness (PDW) method and use 
(W 1), (W 2), etc. to denote the formula (1), (2),... in Wainwright (2009). Recall in our 
paper, the n -vector uj is the imaginary noise at Stage 1, which is the sum of the Gaussian 
noise e and the interaction effects (u J /3x, ..., u)[/3i) T , and hence it is not independent of the 
design matrix X. 

Part I: Verifying strict dual feasibility. 

The goal is to show that, with overwhelming probability, under condition ( 6 ), inequality 
\Zj\ < 1 holds for each j e S c , where Zj is defined in (W10). For every j e S c , conditional 
on Xs, (W37) gives a decomposition Zj = Aj + Bj where 

^ = E j{x,(xJx 5 )-% + n xi (£-)} 

Bj = Tjjsi'Pss) l T‘S-, 

where Ej = X J — E ilS (E l 5 lS ) _i Xj e M" with E i5 ~ Af( 0, 

Condition (Cl) implies 

max I BA < 1 — 7 . 


Conditioned on X 5 and u;, A ] is Gaussian with mean zero and variance Var(dj) < 
PuifBs-\s)M n where 


n \ n j 


U3 


s \ A „.n 


The following lemma, proved in appendix B, generalizes Lemma 4 in Wainwright (2009). 
Lemma 1 For any e G (0, \), define the event T(e) = {M n > M n (e)}, where 


M n (e) = ( 1 + max < e, 


s 2(a 2 + r 2 ) 


C m i n V n J J C m i n 'U A^ Tl 

Then P(T(e)) < Ci exp(— C 2 minjyTie 2 , s}) for some Ci, C 2 > 0. 
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By Lemma 1, 


p [max|^|>n < Pi max|A,-| >7 


< P ( max \AA > 7 | T°(e)) J + C\ exp(— C 2 min{\/ne 2 , s}). ( 12 ) 
\ieS c J 

Note that the goal is to show the probability in ( 12 ) is exponentially decayed. Conditional 
on T c (e), Var(A,) < p u (T, S c\s)M n (e ), so 


p > 7 1 r w)j < 2b - «)=p 


The assumptions of Theorem 1 imply ^ = o(l) and = o(l), so M n (e) = o(l). 
Therefore, it suffices to show that the decaying rate of the exponential term dominates p — s. 
It is easy to check that ( 6 ) can guarantee that maxjg^c \Zj\ < 1 holds with probability at 
least 1 — ci exp(—C 2 miri{s, log(p — s),n^ }). 

Now we show the sufficiency of the alternative condition ( 8 ). In particular,we show (5) 
and ( 8 ) imply ( 6 ), which is equivalent to 

n ^ r, 1 , ,pu^s*\s) n , 2 (a 2 + r 2 )C min 

—s > 2s ' og{p - s) l^ (1 + -I|J->' 

Plugging in (5), we have 

n 0 I / n P«(^5 c |.s) , r> 1 ( \ Pu(^<S c |<s) 2(cW + t )C m i n 

TTi > 2s ' og(p - a) l^ + 2s ' og{p - s) ^-^ -- 

= 2 S log(p- S )^£> + ^ 1 ^xh. (13) 

Cmi n7 0 p log P 

Following the same argument after (W40) in Wainwright (2009), (13) is implied by ( 8 ) for 

> 2. 

Part II: Sign consistency. 

In order to show sign consistency, we need to show that (W13) holds. That is 


sign(/3j + A j) = sign(/3j), for all j e S, 
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where 


A, = ej 


t I X^X 5 


n 


-1 r 


-xlw 

n s 


Ansign (/3 S ) 


From definition, we have 


where 


max | Aj | < F 1 + F 2 < F 1 + (F 2 i + F 22 ), 
ie*s 


F, = A r 


F 2 = 


XJX 5 


-l 


n 


r T Y _ \ 1 l 


sign (As) 


X^X 5 


F 2j i — 


F 2) 2 — 


n 

r T’ 


-Xlw 

n 5 


n ) n 


-Xjyx 

n ) n 


(W41) and a correction version of (W42) give upper bounds of tail probability of F\ and 
F 21 , respectively. That is 


P 


^Fi > c^\ n 



^ < 4exp(—c 2 min{s, log(p 


a)}), 


(15) 


P 


Fo 1 > 20 



< 4exp(—cis). 


Now we work on the addition term F 2 . 2 . By (WOO), 


P 



< 2exp(— n/2). 



2 


A 11 /3x 11 2 


max 

jeS;(k,e)eT 



(16) 
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iXT(X**X* ) is a sample third moment, so by Lemma B.5 in Hao & Zhang (2014a), 


P( -XT(X fc *X*) > e ) < c 4 exp(-c 5 n3e 2 ). 


Therefore, we have 


Overall, 


P -Xjyj > Wfhht < s 3 c 4 exp(-c 5 n3e 


P (f 2 ,2 > Wflihe ) < s 3 c 6 exp(-c 7 n3e 2 ). 

\ ^min / 

l 

Setting e = we have 

TD f T? ^ 9 ll/3x|| 2 ^ ( \ n-7 

p F 2 , 2 > --— < c 8 exp(-c 9 s). (17 

V C min n 3 / 

Combining (15), (16) and (17), we have that with probability greater than 1 — c\ exp(—c' 2 
minis, log (p — s)}), 


max | Aj| < c :i X n +20i 

jes oo 

Therefore (14) holds when /3 min > g(A n ). □ 


1 9\\PxhVs 

C l ]_ 

min'^ Cm in 77,3 


= g(K)- 


7.2 Appendix B 

Proof of Lemma 1. The first summand of M n can be controlled exactly the same way 
as in Wainwright (2009), i.e., 


lm /XJXA- 1 


Z5 < 1 + 


s \ s 


Crnin V Tl / TlCrr\\ 


with probability at least 1 — 2exp(—s/2). 


Turning to the second summand, we observe that Ll x ± is an orthogonal projection matrix 


and a; = e + yx, so 


2 < HU < 2 IIg||| + llyxlll 
C<s V A n n) 2~ ^n n2 ~ n n 
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Note that Hell!/ 0 " 2 ~ X 2 , by (W54a), 


P (M< (1+e) ^) <«*(_?£). (18) 

Moreover, 

n 

llyxll l-nr 2 = J^(uT (3 X ) 2 - r 2 , 

i =1 

is a sum of mean zero independent random variables. Define B = (B ]k ) is the coefficient 
matrix with Bj k = f3j )k / 2, ( j ^ k) and B J3 = fj ]t] . 

For each i, we can write 

U JPz = x7Bxj - E(x ? T Bx,;) = ej Ae t - tr(A), 

where e, : ~ J\f(0, 1), A = (E)^B(E)^. 

The moment generating function M(t) of the quadratic form Ae, : is 

S 

M(t ) = Ee te * Ae * = det(I - 2iA)"3 = fj(l - (19) 

j=i 

where {Aj}| =1 are eigenvalues of A with ascending order. From (19), we have 
E(e,7Ae;) = tr(A), Var(e7Aej) = 2tr(A 2 ) = r 2 , 


and 

Var ((e7Ae,j — tr(A)) 2 ) = 48tr(A 4 ) + 8tr 2 (A 2 ). 

Define IF) = ^ Ae ^ 2 tl(A b , then E (IF)) = 1, Var (IF)) = 12 77(^7) + 2 < 14. Moreover, 


Ee^ 


< 


|e7” Ae ? - —tr(A) | 

Ee 


t e J Ae j-tr(A) 

i e 7 Ae i—*r( A) 

Ee t r 

+ Ee t 

, tr(A) t . 

. tr(A) — t . 

e- f r M(-) 

+ e* t Af(— ) 


+ 




1 

2 
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where a,j = \ 3 / so 12j=i a j — 1- It is easy to see f—. < 1 + x 2 for x G 
For 0 < t < ^, j \/2taj < I, so both summand in the last formula can be controlled by 


fl 1 + 24 

\j =i 


2 «?) 


<[H(l + ayA)\ < M] e " J 

\j =i / \i=i 


a 2 /4 i I W s , - 1 - - 

a .i> * I = e 2 4 = e 8 . 


Therefore, Ee^l* < 2es for 0 < t < And Ee^” 1 ^ < Ee< Ee t+t ^ < 


O ^ +1 

2e 4 +8 . 


By Lemma B.4 in Hao & Zhang (2014a), 


^(bbj — 1) > ne ) < Ci exp(—c 2 n 2 e 2 ), 

i =1 

for some positive constants ci, c 2 . That is 

P (| 11yx111 — nr 2 | > r 2 ne) < ci exp(—c 2 n5e 2 ), 

which implies 

|yx||l 


^ < (! + e ) r 2 ^ < Cl exp jL C2n l e 2^ 


(18) and (20) imply 


IE 


Cl> 


S \ K.n 


>(! + *) 


2(cr 2 + r 2 ) 
Xin 


< c 3 exp ( —c 4 n 2 e 


( 20 ) 


And the conclusion of Lemma 1 follows. □ 


Supplementary of “Model Selection for High Dimen¬ 
sional Quadratic Regression via Regularization” 

Supplementary A: Theorem 2 

In this supplementary to our paper Hao et ah (2014), we show a generalized version of 
Theorem 1 without Gaussian assumption. Similar as in Hao et al. (2014), constants C \, 
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C* 2 and ci, C 2 ,... are locally defined and may take different values in different sections. We 
start with a brief review of definition of a subgaussian random variable and its properties. 

A random variable X is called ^-subgaussian if for some b > 0, E(e tx ) < e ^ 2 * 2 / 2 for all 
t G M. The set of all subgaussian random variables is closed under linear operation by the 
following proposition. 

Proposition 1 Let X t be bi~subgaussian for i = 1 Then a \Ad + ... + a n X n is B- 

subgaussian with B = )T )” =1 a t \b % . Moreover, if X\,...,X n are independent, a\Xi + ... -\-a n X n 
is B-subgaussian with B = (5 /)/ =1 «?&?) 2 • 

Moreover, the tail probability of a subgaussian variable can be well controlled. 

_t 2 L 

Proposition 2 If X is b-subgaussian, then P(|A| > t) < 2e ^ for all t > 0. Moreover, 
there exists a positive constant, say a = 1/6 b 2 , such that Ee aX2 < 2. 

These well-known results can be found, e.g., in Rivasplata (2012). 

Condition (SG) {x ,}/ =1 are IID random vectors from an elliptical distribution with marginal 
6 -snbgaussian distribution. Moreover, {c,}/ =] are IID with 6 -subgaussian distribution. 

We still use S and denote the covariance matrix of Xj and its submatrix corresponding 
to index sets A and B. B = ( Bj ^.) is the coefficient matrix for interaction effects with 
Bjk = P],k/2, (j ^ k) and Bjj = f3jj. A min (A) and A max (A) denote the smallest and largest 
eigenvalues of a matrix A. We need the following technical conditions: 

(Cl) (Irrepresentablc Condition) HEsc^E ^) -1 < 1 — 7 , 7 G (0,1]. 

(C 2 ) (Eigenvalue Condition) A min (E l 5 l 5 ) > C m in > 0 . 

(C3) (Dimensionality and Sparsity) .slogp = o(n ) and s(logs)5 = o(n 5 ). 

(C4) (Coefficient Matrix) B is sparse and supported in a submatrix B 5lS . A max (B 2 ) = 
A max (B| iS ) < for a positive constant Cb- 



Condition (C3) is employed to replace ( 6 ) in Theorem 1. Similar conditions are standard 
in the literature. Condition (C4) on B is used to control the overall interaction effect, which 
is treated as noise in stage one. A max (B) can be bounded, e.g., by 11/3x 11 1 • 

1 

Theorem 2 Suppose that conditions (SG), (Cl)-(Cf) hold. For X n 3> t (\ogp/n) 2 , with 
probability tending to 1, the LASSO has a unique solution /3 l with support contained within S. 
Moreover, if fd mhl = min ie5 \/3j\ > 2 (s “2 + \\^x\\ 2 /s + X n s^)/C min , then sign(/3 L ) = sign(^). 

Note that 11 /3x 11 2 = tr(£> 2 ) < sC^, so 11 /3x 11 2 / ^ < Cbs - ^. 

Supplementary B: Proof of Theorem 2 

Recall that we use (Wl), (W2),... to denote the formula (1), (2),... in Wainwright (2009). 
The n- vector uj is the imaginary noise at Stage 1, which is the sum of the subgaussian noise 
e and the interaction effects (u J f3x, ...,u J/3i) T . 

Part I: Verifying strict dual feasibility. 

We show that inequality \Zf < 1 holds for each j E S c , with overwhelming probabil¬ 
ity, where Zj is defined in (W10). For every j E S c , conditional on X 5 , (W37) gives a 
decomposition Zj = Aj + Bj where 

^ = Ej{x,(xJx 5 )-% + n xi (£-)} 

Bj = T>j S (Yj SS ) l z s , 

where Ej = Xj — E^Ess^Xj E M n with entries E Vj that is 26-subgaussian by Proposition 
1 and condition (Cl). 

Condition (Cl) implies 

max \BA < 1 — 7 . 
jeS? J ' 


Conditional on X 5 and u:, Aj is 2bMn -subgaussian, where 


Xd-n — Z 5 
n 


xjx 5 


n 


z s + 


ir 


<JL> 

X „n 


We need the following lemma that is proved in Supplementary C. 
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Lemma 2 For any e G (0, \), define the event T(e) = {M n > M n (e)}, where 


MJe) = 


2s 4 (a 2 + r 2 ) 

C mm n A In 


Then P(T(e)) < C\s 2 exp(—Cynic 2 ) for some Ci, C 2 > 0 . 


By Lemma 2, 


< Pi max \Aj\ >7 


< P j^max \Aj\ > 7 | T^e))^ + Cds 2 exp^CV^e 2 ). ( 21 ) 


Conditional on T°{e), A 3 is 2bM\ (e)-subgaussian, so by Proposition 2 


(max \Aj\ > 7 | T° (e))^ < 2 (p - s) exp (- 


7 

8b 2 MJe) 


where the right hand side goes to 0 by condition (C3). Therefore, max^^c \Zf < 1 holds 
with probability tending to 1 . 

Part II: Sign consistency. 


In order to show sign consistency, by Lemma 3 in Wainwright (2009) it is sufficient to 


show 


signed,- + A j) = sign(/3j), for all j e 5, 


where 


A? e j 


T Pts^S 


ixc\ 1 ri 


-X 5 w - A n sign (Ps) 


It is straightforward that 


max IAJ < 
jes 


XjXs 




l 

-X]a> - A n sign (/3 S ) 

n 


-Xje + -Xjy x + ||A„sign(/3 5 )|| 2 

n o n o 
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By Lemma 3, 


xjx 5 


< 2/Cmin; 


with probability at least 1 — s 2 C^ exp(— C^ri/s 2 ). Moreover, 

||A n sign(/3 lS )|| 2 < A n sL 


-Xj yi < \\f3 x \\ 2 max ( -X. T (X fc *X,) ) , 
n 2 j,k/es [ n J J 

where pXj(X*. *X^) is a sample third moment. By Remark B.2 and Lemma B.5 in Hao & 
Zhang (2014a), 


P -Xj(X fc *X^) >e < ciexp(-c 2 n3e 2 ) 


Because |<5| = s, we have 


P -X^yj > ||/3 x || 2 e < s 3 ci exp(-c 2 n,3e 


which, with e = 1/s leads to 


P — X^yj > \\P1W2S < s 3 ciexp(-c 2 n 3 / 


Similarly, 


which, with e = 1 /s leads to 


P ( —X^e > s 2 e ) < SC 3 exp(—cyne 2 ), 


P —Xie >s 2 < sc 3 exp(— c 4 n/s 2 ). 

V n J 


Overall, with probability greater than 1 — C 5 S 3 exp(— CqTis /s 2 ), 

max | Ay| < 2 (s”3 + ||/3x|| 2 s _1 + A tl sO /C min = g(\ n ). 
jes \ / 


Therefore (22) holds when /3 min > g( A n ). □ 
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Supplementary C: Proof of Lemma 2. 

The first summand of M n can be bounded as 

1- T / XjXg V 1 - 2 s 
n s \ n ) s nC min 

with probability at least 1 — s 2 C^ exp(— C^n/s 2 ), where C 3 , C 4 are positive constants. It 
directly follows the fact ||zs||| < s and Lemma 3 in Supplementary D, which says the largest 
eigenvalue of ( —^— ) can be controlled by 2 /C min . 

For the second summand, because II X ± is an orthogonal projection matrix and uj = £+y x , 
we have 

IMI1 < 2 

X 2 n n 2 - A \n n 

As {e ,}” =1 are IID subgaussian, by Proposition 2, and Lemma B.4 in Hao & Zhang 
(2014a), we have 

P < (1 + e)<r 2 ^ < Ci exp (—c 2 ne 2 ) . (23) 

On the other hand, 

n 

\\yi\\l-nT 2 = f3 x ) 2 -t 2 , 

i =1 

is a sum of mean zero independent random variables. 

Define W t = -1, then E(Wj) = 0. By condition (C l), 

uj(3 x = xjBx,; - E(x7Bxj) = (x J :)jB 5lS (xi) lS - E ((x i )jB <S5 (x i ) <s ) . 

So W t is a degree 4 polynomial of subgaussian variables dominated by [(^(x^J(xj)^] 2 , which 
is, up to the constant Cg, a summation of at most s 2 degree 4 monomials of subgaussian 
variables. The tail probability of each of these monomials can be bounded as in Lemma B.5 
in Hao & Zhang (2014a). Therefore, we have 


n. 


u ; 

A n n 


P 




i —1 


> ne < c 3 s 2 exp(—C4U 2 e 2 ), 
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for some positive constants C 3 , c 4 . That is 


p (| llyxlll ~nr 2 1 > r 2 ne) < c 3 s 2 exp(-c 4 n 2 e 2 ), 


which implies 


(23) and (24) imply 


if 


^jlwjjy < ^ e ) r 2 ^ < c 3 s 2 exp (^—c 4 n 2 e 2 ^j . 


(24) 


Cl) 


S \ K,n 


— (1 + e) 


2(cr 2 + T 2 ) 
A In 


< C 5 S exp ( —can 2 e ) , 


for some positive constants C5, Cq. With e = 1, the conclusion of Lemma 2 follows. □ 

Supplementary D: Lemma 3 and its proof. 

Lemma 3 Under conditions (SG) and (C3), we have 


P A 


Xjx 5 


n 


> C'min/2 ) > 1 — s 2 C 3 exp(— C 4 n/s 2 ) —> 1, 


where (7 m j n — A m ; n (S l 5 l 5 ), C 3 > 0, C 4 > 0. 


Proof. We need bound 


P 



(S55 


XjX 5 /n)v | 


> e 


(25) 


For easy presentation, we assume that the s-vector v is indexed by S. Then 


|v T (S^-XjX 5 /n)v| 

^ ^ ^ \ v j v k\\^jk X^- X^/fi 

j,kes 

< Il v ll?max|E jfc -X T X fc /n| 

j,kes J 

< s max | S — Xj Xfc /n | 

j,kes J 


So (25) is bounded from above by 


P max iS.-fc 
\j,kes 



(26) 
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Following Remark B.2 and Lemma B.5 in Hao & Zhang (2014a), it is easy to derive 


P (|Ejfc - XjXfc/n| > e) < C 3 exp(-C 5 ne 2 ), 

for constants C 3 > 0, C 5 > 0 under subgaussian assumption. Therefore, (26) is further 
bounded by s 2 C 3 exp(— C 5 ne 2 /s 2 ). Take e = min{C min / 2 , 1 / 2 }, we have 

P ^A min > Cmin/2^ > 1 - s 2 C 3 exp(—C 4 n/s 2 ) -A 1, 

by condition (C3), where C 4 = C' 5 (min{C' m i n / 2 , 1 / 2 }) 2 . 
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